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Abstract 

This paper deals with the classical problem of density estimation on the real line. Most of the 
existing papers devoted to minimax properties assume that the support of the underlying density 
is bounded and known. But this assumption may be very difficult to handle in practice. In this 
work, we show that, exactly as a curse of dimensionality exists when the data lie in M. d , there exists 
a curse of support as well when the support of the density is infinite. As for the dimensionality 
problem where the rates of convergence deteriorate when the dimension grows, the minimax rates of 
convergence may deteriorate as well when the support becomes infinite. This problem is not purely 
theoretical since the simulations show that the support-dependent methods are really affected in 
practice by the size of the density support, or by the weight of the density tail. We propose a 
method based on a biorthogonal wavelet thresholding rule that is adaptive with respect to the 
nature of the support and the regularity of the signal, but that is also robust in practice to this 
curse of support. The threshold, that is proposed here, is very accurately calibrated so that the 
gap between optimal theoretical and practical tuning parameters is almost filled. 

Keywords Density estimation, Wavelet, Thresholding rule, infinite support. 
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1 Introduction 

This paper deals with the classical problem of density estimation for unidimensional data. Our aim is 
to provide an adaptive method which requires as few assumptions as possible on the underlying density 
in order to apply it in an exploratory way. In particular, we do not want to have any assumption on 
the density support. Moreover this method should be quite easy to implement and should have good 
theoretical performance as well. 

Density estimation is a task that lies at the core of many data preprocessing. From this point of 
view, no assumption should be made on the underlying function to estimate. Without giving a full 
survey of the subject, let us describe classical methods of the literature. 

At least in a first approach, histograms or kernel metho ds are often used. The main problem is 
to choose the bandwidth (see for instance Silverman ( 19781 )). which is usually performed by cross- 



validation (see the fundamental paper by iRudemd (]1982r )). There is no clear theoretical results about 
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the performance of this cross-validation method from an adaptive minimax point of view. However, 
methodologies based on kernel methods are the most wide s pread in practice. Consequently, several 
data-driven methods have been developed (see Silverman ( 19861 ) for a good review). There exist 
fundamentally two ways to extend cross-validation. The first one i s to o btain more competitive 
methods from the computational point of view (see iGrav and Moorej (120031 )1. The second one is to 
find more robust and less undersmoothing methods (see Jones et al. ( 19961 ) for a recent survey). 

All these methods suffer fro m a lack of spatial ada ptivity since the bandwidth is selected uniformly 
in space. To improve this point. ISain and Scottl ljl99fil ) have suggested a practical kernel method which 
makes the choice of the bandwidth more local, this algorithm being still based on intensive cross- 
validation. All these methods do not require in practice the preliminary knowledge of the support but 
do not provide theoretical guarantees from the minimax point of view. On the contrary, in the white 
noise model, under assumptions on the underlying signal and its support, it is possible to select the 



best p ossible local bandwidth in the adaptive minimax setting via the Lespki method (see lLepski et al. 
(1993) for instance). 

The Lepski method i s close ly related to model selection methods. Following Akaike's criterion 
for histograms, ICastellanl (|200d ) has de rived adaptive minim a x pro cedures for density estimation (see 
iaiij for detailed^ and iBir.e and TwJ » for a practical point of view). To 



remedy t he lack of s moot hness of histograms, pie cewise polynomial estimates can also be used (see for 
i nstan ce 
(l2007h or 



Castellan! d2003h and iRozenhold (j200fsh for the corresponding software, IWillett and Nowakl 



Koo et al. ( 19991 ) for the spline basis). It is worth emphasizing that the necessary input of all 



these methods is the support of the u nderlying den s ity, cl a ssically assumed to be [0, 11. We can a l so cite 
the results based on l\ penalties. See iBunea et al] (j2007h . iBunea et all tosh and lBertin eToZI (120091 ) 
who derived oracle inequalities for which no assumptions on the suppo rt are made. Howev er, minimax 
optimality is not investigated in these papers and for simulations, Bertin et al. ( 20091 ) considered 
signals supported by [0, 1]. So, whether the support plays a key role for £± methodologies remains an 
open question even if we naturally conjecture that the answer is yes. In practice, the data are usually 
rescaled by the smallest and largest observations before performing any of the previous algorithms. 
This preprocessing has not been studied theoretically. In particular, what happens if the density is 
heavy-tailed? 

Now let us turn to wavelet thresholding. Donoho et al. ( 19961 ) have first provided theoretical 
adaptive minimax results in the density setting. This paper is a theoretical benchmark but their 
threshold depends on the extraknowledge of the infinite norm of the underlying density. In practice, 
even if this quantity is known, t his choice is of ten too conservative. From a computational point of 
view, the DWT algorithm due to iMallatJ (jl989h combined with a keep or kill rule on each coefficient 
makes these methods as one of the easiest adaptive methods to implement, onc e the threshold is known . 
Here lies the fundamental problem: after rescaling and binning the data as in Antoniadis et al. ( 19991 ) 
for instance, one can reasonably think that the nu mber of observation s in a "not too small" interval is 
Gaussian, up to some eventual transformation (see Brown et al. ( 20071 )). So basically the thresholding 
rules adapted to the Gaus sian regression se t ting s hould work here (we refer the reader to the very 
complete review paper of Antoniadis et al. ( 200ll ) which provides descriptions and comparisons of 
various wavelet shrinkage and thresho lding estimat o rs in the regression setting) . Of course many 
assumptions are required. Even if in iBrown et al] (|2007h . theoretical justifications are given, the 
method still relies heavily on the pr ecise knowledge of th e support which is directly linked to the size 
of the bins. In their seminal work iHerrick et al\ (|200ll ) have already observed that in practice the 
basic Gaussian approximation for general wavelet bases was quite poor. This can be corrected by 
the use of the Haar basis and accura te thresholdin g rules but the reconstructions are consequently 
piecewise constant. Note also that in lHerrick et al\ (I200lh no assumption was made on the possible 
support of the underlying density. More recently, lJuditsky and Lambert-Lacroix (120041 ) have proposed 
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an adaptive thresholding procedure on the whole real line. Their threshold is not based on a direct 
Gaussian approximation. Indeed, the chosen threshold depends randomly on the localization in time 
and frequency of the coefficient that has to be kept or killed. They derive adaptive minimax results for 
Holderian spaces, exhibiting rates that are different from the bounded support case. However there is 
a gap between their optimal theoretical and practical tuning parameters of the threshold. 

If the main goal of this paper is to investigate assumption-free wavelet thresholding methodologies 
as explained in the first paragraph, we also aim at fulfilling this gap by designing a new threshold 
depending on a tuning parameter 7: the precise form of the threshold is closely related to sharp 
e xponential inequalities for iid variables , avo iding the use of Gau ssian approximation. Unlike methods 
of Juditsky and Lambert-Lacroix ( 20041 ) and Herrick et al. ( 200ll ). all the coefficients (and in particular 



the coarsest ones) are likely to be thresholded. Moreover, since our threshold is defined very accurately 
from a non asymptotic point of view, we obtain sharp oracle inequalities for 7 > 1. But we also prove 
that taking 7 < I deteriorates the theoretical properties of our estimators. Hence the remaining 
gap between theoretical and practical thresholds lies in a second order term (see Section for more 
details). The construction of our estimators and the previous results are stated in Section 2. Next, in 
Section 3, we illustrate the impact of the bounded support assumption by exhibiting minimax rates of 



conve rgence on the whole class of Besov spaces extending the results of lJuditsky and Lambert-Lacroix 



(2004). In particular, when the support is infinite, our results reveal how minimax rates deteriorate 



according to the sparsity of the density. We also show that our estimator is adaptive minimax (up 
to a logarithmic term) over Besov balls with respect to the regularity but also with respect to the 
support (finite or not). In Section 4, we investigate the curse of support for the most well-known 
support-dependent methods and compare them with our method and with the cross-validated kernel 
method. Our method, which is naturally spatially adaptive, seems to be robust with respect to the 
size of the support or the tail of the underlying density. We also implement our method on real data, 
revealing the potential impact of our methodology for practitioners. The appendices are dedicated to 
an analytical description of the biorthogonal wavelet basis but also to the proofs of the main results. 

2 Our method 

Let us observe a n-sample of density / assumed to be in L2(M). We denote this sample X-\ , . . . , X„ , . 
We estimate / via its coefficients on a special biorthogonal wavelet basis, due to lCohen et al\ (|1992l ). 
The decomposition of / on such a basis takes the following form: 

f = j2P-ik^-i k +J2J2^jk, (2.1) 



where for any j > and any k G 

P-\k = I f(x)ip-i k (x)dx, 0jk = I f(x)if)jk(x)dx. 



The most basic example of biorthogonal wavelet basis is the Haar basis where the father wavelets are 
given by 

Vfc G Z, V-lfc = V'-lfe = l[fe;fc+l] 

and the mother wavelets are given by 

Vj > 0, V/C G Z, l/j jk = Tpjk = 2 j/2 (l[fc2-J;(fc+l/2)2-J) ~ 1 [(fc+l/2)2~J ;(Jfe+l)2-J] ) ■ 

The other examples we consider are more precisely described in Appendix A. The essential feature 
is that it is possible to use, on one hand, decomposition wavelets ipjk that are piecewise constants, 
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and, on the other hand, smooth reconstruction wavelets ipjk- I n particular, except for the Haar basis, 
decomposition and reconstruction wavelets are different. To shorten mathematical expressions, we set 

A = {(j,k): j>-l,k€Z} (2.2) 

and (|2.ip can be rewritten as 

f = ^2 Pik^jk with (3 jk = / ip jk {x)f{x)dx. (2.3) 
(j,k)eA 

A classical unbiased estimator for (5jk is the empirical coefficient 

1 n 

^k = -Y.^{Xi), (2.4) 



n 



whose variance is & 2 k /n where 



a jk = J ^ 2 jk{x)f{x)dx -(J i/) jk (x)f(x)dx 
Note that a 2 - k is classically unbiasedly estimated by d 2 k with 

n i—1 
V ; i=2 1=1 

Now, let us define our thresholding estimate of /. In the sequel there are two different kinds of steps, 
depending on whether the estimate is used for theoretical or practical purposes. Both situations are 
respectively denoted 'Th.' and , Prac\ 

Step 

Th. Choose a constant c > 1, a real number d and let jo such that jo = [log2([^ c (log n) c '])J . 
Choose also a positive constant 7. 

Prac. Let j = [k>g 2 (n)J. 

Step 1 Set T n = {(j, k) : — 1 < j < jo, k S Z} and compute for any (j, k) E T n , the non-zero empirical 
coefficients (3jk (whose number is almost surely finite). 

Step 2 Threshold the coefficients by setting ftjk = $jk"^\g. i> according to the following threshold 
choice. 

Th. Overestimate slightly the variance a 2 k by 



^ logn „ , „ 9 logn 
a] k = a] k + 211^^1100^2^^— + ^^JL- 



and choose 



= Vjkr< = \l 2 ~f a jk— + ■ ( 2 - 5 ) 
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Prac. Estimate unbiasedly the variance by 3j k and choose 



Prac lQ g n , 2 Wjk oology ,„ „, 

n" = 'ijk =i 2 °j k — + — Yn — • (2 - 6) 



Step 3 Reconstruct the function by using the /3jkS and denote 
Th. 



/n l7 = ^2 Pjkipjk (2.7) 
Cj\fc)er„ 

Prac. 

fj rac = I E #A I ( 2 - 8 ) 

\(j\fc)er„ / + 

Note that this method can easily be implemented with a low computational cost. In particular, 
unlike the DWT-based algorithms, our algorithm does not need numerical approximations, except at 
Step 3 for the computations of the V>jfc (unless, we use the Haar basis). However, a preprocessing, 
independent of the algorithm, can be used to compute reconstruction wavelets at any required pre- 
cision. Both practical and theoretical thresholds are based on the following heuristics. Let cq > 0. 
Define the heavy mass zone as the set of indices (j, k) £ A such that f(x) > cq for x in the support of 
Tpjk and HV'jfellSo = °n{n(log n) In this heavy mass zone, the random term of (|2.5p or (|2.6p is the 
main one and we asymptotically derive that with large probability 



2 7 3^ and ^ * ^23^. 



The shape of the r ight hand terms in (|2.9p is classical in the dens ity estimation framework (see 
Donoho et oil <|l99fih ). In fact, they look like the thre shold proposed bvlJuditskv and Lambert-Lacrobd 



( 20041 ) or the universal threshold rf proposed by Donoho and Johnstone! ( 19941 ) in the Gaussian 



regression framework. Indeed, we recall that, in this set-up, 



rf = v / 2<t 2 logn, 

where a 2 (assumed to be known in the Gaussian framework) is the variance of each noisy wavelet 
coefficient. Actually, t he deterministic term o f (12.51) (or (12.61) ) constitutes the main difference with the 



threshold proposed by lJuditsky and Lambert-Lacroixl (J2004J): it replaces the second keep or kill rule 



applied by Juditsky and Lambert-Lacroix on the empirical coefficients. This additional term allows to 
control large deviation terms for high resolution levels. It is directly linked to Bernstein's inequality 
(see the proofs in Appendix B). The forthcoming oracle inequality (Theorem [TJ holds with (|2.5p for 
any 7 > 1: this is essential to fulfill the gap between theory and practice. Indeed, note that if one 
takes c = 7 = 1 and c' = then the main difference between (|2.5I) and (|2.6p is that a second order term 
exists in the estimation of a 2 - k by <7? fc . But the main part is exactly the same: when the coefficient 
lies in the heavy mass zone and when 7 tends to 1, fyk^ tends to r]^ ac with high probability. Indeed, 
one can note that for all e > and 7 > 1, 



n 
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As often suggested in the literature, instead of estimating Var (/?.,■/%), we could have used the in- 
equality 

. al llflL. 
Var(/3 jfc ) = < 



n n 



and we could have replaced <rj fc with ||/||oo m the definition of the threshold. But this requires a 
strong assumption: / is bounded and ||/||oo is known. In our paper, Yax{j3jk) is accurately estimated 
making those conditions unnecessary. Theoretically, we slightly ov erestimate <r| fc to control large 
deviat ion terms and this is the reason why we introduce <r| fc . Note that lReynaud-Bouret and Rivoirard 



^1 — • • 

(2009|) have proposed thresholding rules based on similar heuristic arguments in the Poisson intensity 



estimation framework. But proofs and computations are more involved for density estimation because 
sharp upper and lower bounds for <r| fc are more intricate. 

For practical purpose, ??yfc, 7 (even with 7 = 1) slightly oversmooths the estimate with respect to 
rjj^ ac . From a simulation point of view, the linear term 2 H^j fc j^° logra m -nj^ ac with the precise constant 
2/3 seems to be accurate. 

The remaining part of this section is dedicated to a precise choice of 7, first from an oracle point 
of view, next from a theoretical and practical study. 

2.1 Oracle inequalities 

The oracle point of view has been introduced by Donoho and Johnstone ( 19941 ). In this approach, an 



estimate is optimal if it can essentially mimic the performance of the "oracle estimator" . Let us recall 
that the latter is not a true estimator since it depends on the function to be estimated but it represents 
an ideal for a particular method (namely, here, wavelet thresholding). So, in our framework, the oracle 
provides the noisy wavelet coefficients that have to be kept. It is easy to see that the "oracle estimate" 
is 

fn= ^2 Pjklpjk, 

where (3 jk = Pjk^-ia 2 , ><r?./nl satisfies 

E [CPjk ~ Pjkf] = min 

By keeping the coefficients j3jk larger than the thresholds defined in (|2.5[) . our estimator has a risk 
that is not larger than the oracle risk, up to a logarithmic term, as stated by the following result. 

Theorem 1. Let us consider a biorthogonal wavelet basis satisfying the properties described in Ap- 
pendix A. If 7 > c, then / n ~ satisfies the following oracle inequality: for n large enough 




E 



\\fn,j-f\\ 



< Cl 



Co log n 
+ 6 (2.10) 



n 



where C\ is a positive constant depending only on 7, c and the choice of the wavelet basis and where 
C2 is also a positive constant depending on 7, c, c' , \\fW2 an d the choice of the wavelet basis. 

Note that Theorem [T] holds with c = 1 and 7 > 1, as announced. Following the oracle point of 
view of Donoho and Johnstone, Theorem [T] shows that our procedure is optimal up to the logarithmic 
factor (and the negligible term log n/n). This logarithmic term is in some sense unavoidable. It is the 
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price we pay for adaptivity, i.e. the fact that we do not know the coefficients to keep. Note also that 
our result is true provided / € L2(K). So, assumptions on / are very mild here. This is not the case 
for most of the results for non-parametric estimation procedures where one assumes that ||/||oo < oo 
and that / has a compact support. Note in addition that this support and ||/||oo are often known in 
the literature. On the contrary, in Theorem [H / and its support can be unbounded. So, we make as 
few assumptions as possible. This is allowed by considering random thresholding with the data-driven 
thresholds defined in (|2.5p . 



2.2 Calibration issues 

We address the problem of choosing conveniently the t hreshold parameter 7 from the theoretical 
point of view. The aim and the proofs are inspired by Birge and Massart ( 20071 ) who considered 



penalized estimators and calibrated constants for penalties in a Gaussian framework. In particular, 
they showed that if the penalty constant is smaller than 1, then the penalized estimator behaves in a 
quite unsatisfact ory way. This study was used in practice to derive adequate data-driven penalties by 



Lebarbierl (|2005|) 



According to Theorem [H we notice that for any signal, taking c = 1 and d = 0, we achieve the 
oracle performance up to a logarithmic term provided 7 > 1. So, our primary interest is to wonder 
what happens, from the theoretical point of view, when 7 < 1? 

To handle this problem, we consider the simplest signal in our setting and we compare the rates 
of convergence when 7 > 1 and 7 < 1. 

Theorem 2. Let f = l[o,i] and let us consider / n>7 with the Haar basis, c = 1 and d = 0. 
• 7/7 > 1 then there exists a constant C depending only on 7 such that 

log n 



E(||/n " fV) < C- 



n 



If 7 < 1, then there exists 5 < 1 depending only on 7 such that 



E(||/ n -/|||)>l(l + 0n (l)). 



Theorem [2] establishes that, asymptotically, f n ~ with 7 < 1 cannot estimate a very simple signal 
(/ = l[o,i]) at a convenient rate of convergence. This provides a lower bound for the threshold 
parameter 7: we have to take 7 > 1. 

We reinforce these results by a simulation study. First we simulate 1000 n-samples of density 
/ = lr 0) i] . We estimate / by /,f rac using the Haar basis, but to see the influence of the parameter 7 
on the estimation, we replace rj^ ac (see Step 2 (12. 6p ) by 



, ^ 2 logn 27||V; ifc || 00 logn 

n» = V 27 ^~ + — ^ — • (2 - n) 

For any 7, we have computed M I S E n {^/) i.e. the average over the 1000 simulations of ||/ l f rac -/|| 2 . On 
the left part of Figure [Tj (U), MISE n {^) x n is plotted as a function of 7 for different values of n. Note 
that when 7 > 1, MISE n {^) is null meaning that our procedure selects just one wavelet coefficient, 
the one associated to "0-1,0 = l[o,l]! all others are equal to zero. This fact remains true for a very large 
rang e of values of 7. This plateau phenom enon has already been noticed in the Poisson framework 
(see Revnaud-Bouret and Rivoirard ( 20091 )). However as soon as 7 < 1, MISE n {^f) x n is positive 
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Figure 1: n x MISE n {^) for (U) / = l[o,i] (the Haar basis is used) ; (G) / Gaussian density with 
mean 0.5 and standard deviation 0.25 (the Spline basis is used) ; (B) / is the renormalized Bumps 
signal (the Spline basis is used) 

and increases when 7 decreases. It also increases with n tending to prove that MISE n (j) » 1/n for 
7 < 1. This is in complete adequation with TheoremEl Remark that, from a theoretical point of view, 
the proof of part 2 of Theorem [2] holds for any choice of threshold that is asymptotically equivalent to 

V^fc^in the heavy mass zone and in particular for the choice (|2. 1 1 j) . From a numerical point of 
view, the left part of Figure H (U) would have been essentially the same with rjjk t j, i.e. f|2.5j) instead 
of (|2.1ip . The reason why we used (|2.11|) is the practical performance when the function / is more 
irregular with respect to the chosen basis. Indeed we consider two other density functions /. The first 
one is the density of a Gaussian variable whose results appear in the middle part of Figure [T] (G) and 
the second one is the renormalized Bumps signal whose results appear in the right part of Figure 
[1] ( B). In both cases we computed /,f rac with the Spline basis : this basis is a particular possible 
choice of the wavelet basis which leads to smooth estimates. A description is available in Figure [9] of 
Appendix A. We computed the associate MISE n {^) over 100 simulations. Note that for the Bumps 
signal, there is no plateau phenomenon and that the best choice for 7 is 7 = 0.5 as soon as the highest 
level of resolution, j'o(n) is high enough to capture the irregularity of the signal. If n is too small, the 



1 The renormalized Bumps signal is a very irregular signal that is classically used in wavelet analysis. It is here 
renormalized so that the integral equals 1 and it can be defined by [ V a, (l + k^A) ~ ] ^ with 

& J V w j J J °- 284 

p = [ 0.1 0.13 0.15 0.23 0.25 0.4 0.44 0.65 0.76 0.78 0.81 ] 
g = [ 4 5 3 4 5 4.2 2.1 4.3 3.1 5.1 4.2 ] 

w = 0.005 0.005 0.006 0.01 0.01 0.03 0.01 0.01 0.005 0.008 0.005 1 
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best choice is to keep all the coefficients. As already noticed in Reynaud-Bouret and RivoirardI ( 20091 ). 
there exists in fact two behaviors : either the oracle f n is close to / and the best possible choice is 
7 ~ 1 with a plateau phenomenon, or the oracle f n is far from / and it is better to take a smaller 7 
(for instance 7 = 0.5). The Gaussian density (G) exhibits both behaviors. For large n (n > 1024), 
there is a plateau phenomenon around 7 = 1. But for smaller n, the oracle f n is not accurate enough 
and taking 7 = 0.5 is better. Note finally that the choice 7 = 1, leading to our practical method, 
namely fn Tac , is the more robust with respect to both situations. 



3 The curse of support from a minimax point of view 



The goal of this section is to derive the minimax rates on the whole class of Besov spaces. The sub- 
sequen t results will constitute generalizations of the results derived in lJuditskv and Lambert-Lacroix 
(|2004h who pointed out minimax rates for density estimation on the class of Holder spaces. For this 



purpose, we consider the theoretical procedure / n/y defined with the choice d = — c (see Step 0) 
where the real number c is chosen later. In some situations, it will be necessary to strengthen our 
assumptions. More precisely, sometimes, we assume that / is bounded. So, for any R > 0, we consider 
the following set of functions: 



£2,00 (-R) = {/ is a density such that ||/||2 < R and 



< R}. 



The Besov balls we consider are classical (see Appendix A for a definition with respect to the biorthogo- 
nal wavelet basis) and denoted BpJR). Let us just point out that no restriction is made on the support 
of / when / belongs to B" (R): this support is potentially the whole real line. Now, let us state the 
upper bound of the L2-risk of f nn . 



< a < r + 1, where we 



Theorem 3. Let R,R'>0, 1 < p, q < 00 and a £ R such that max ^0, | — ^ 

recall that r (r > 0) denotes the wavelet smoothness parameter introduced in Appendix A. Let c > 1 
such that 

a(l- , 1 ,)> 1 -- 1 - (3.1) 
V c(l + 2a)y " p 2 y J 

and 7 > c. Then, there exists a constant C depending on R' , 7, c, on the parameters of the Besov ball 
and on the choice of the biorthogonal wavelet basis such that for any n, 



ifp < 2, 



sup E 

/eB«„(/?)n£ 2 ,oc(R') 



|/n, 7 " /IP 



< c 



2a 

log n\ 2a + 1 
n 



(3.2) 



ifp > 2, 



sup E 

/eS£ 9 (R)nL 2 (i?') 



IliWy - ft 



< c 



log n 



n 



(3.3) 



First, let us briefly comment assumptions of these results. When p > 2, (|3.1I) is satisfied and the 
result is true for any c > 1 and < a < r + 1. In addition, we do not need to restrict ourselves to 
the set of bounded functions. When p < 2, the result is true as soon as c is large enough to satisfy 
(|3,ip and we establish (|3.2p only for b ounded funct ions. Actually, this assumption is in some sense 
unavoidable as proved in Section 6.4 of Birge ( 20081 ). 
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Furthermore, note that if we additionally assume that / is bounded with a bounded support (say 
[0, 1]) then E \\f nn — f\\ is always upper bounded by a constant times (log n/n) 2a + 1 whatever p is, 

since, in this case, the assumption / G Bp OQ (R) implies / G B^^iR) for R large enough and p > 2. 

Now, combining upper bounds (|3.2p and (|3.3p . under assumptions of Theorem O we point out the 
following rate for our procedure when / is bounded but without any assumption on the support: 



sup E 

/es« (fl)n£ a ,oo(iP) 



Wfn-ff 



< c 



log n 



n 



The following result derives lower bounds of the minimax risk showing that this rate is the optimal 
rate up to a logarithmic term. So, the next result establishes the optimality properties of f n ^ under 
the minimax approach. 



Theorem 4. Let R,R'>0, 1 < p, q < 00 and a G M. such that max (0, J -5] < a < r + 1. Then, 
there exists a positive constant C depending on R' and on the parameters of the Besov ball such that 



lim 



inf n 

— >+oo 



°+i+U-*) + i 



inf sup E 



II/- /IP 



where the infimum is taken over all the possible density estimators f. 
Furthermore, let c, p* > 1 and a* > suc/i that 



a 



1 



1 



c(l + 2a* 



> 



1 



(3.4) 



Then our procedure, f n ~, constructed with this precise choice of c and 7 > c, is adaptive minimax up 
to a logarithmic term on 

{B^ g (R) n £2,00(^0 : a* < a < r + 1, p* < p < +00, 1 < q < 00} . 

When p < 2, the lower bound for the minim ax risk corresponds to the classical minimax rate 
for estimating a compactly supported density (see Donoho et all ( 19961 )). In addition, the procedure 



f n ^ achieves this minimax rate up to a logarithmic term. When p > 2, the risk deteriorates, if no 
assumption on the support is made, whereas it remains the same when we add the bounded support 
assumption. Note that when p = 00, the exponent becomes a/(l + a): this rate was also derived in 
Juditskv and Lambert-Lacroixl (J2004J) for estimation on balls of B^ jOC . 

To summarize, we gather in Table [1] the lower bounds for the minimax rates obtained for each 
situation. Those bounds are adaptively achieved by our estimator with respect to p, a and the com- 
pactness of the support, up to a logarithmic term. If the logarithmic term is known to be unnecessary 
in the bounded support case, the question remains open in the other case. 

Our results show the role played by the support of the functions to be estimated on minimax 
rates. As already observed, when p < 2, the support has no influence since the rate exponent remains 
unchanged whatever the size of the support (finite or not). Roughly speaking, it means that it is not 
harder to estimate bounded non-compactly supported functions than bounded compactly supported 
functions from the minimax point of view. It is not the case when p > 2. Actually, we note an 
elbow phenomenon at p = 2 and the rate deteriorates when p increases: this illustrates t he curse of 
suppo rt from a minimax point of view. Let us give an interpretation of this observation. iJohnstone 
(|l994h showed that when p < 2, Besov spaces B^ q model sparse signals where at each level, a very few 
number of the wavelet coefficients are non-negligible. But these coefficients can be very large. When 
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1 < P < 2 


2 < p < oo 


compact support 


2a 
U 2a+l 


2a 

n 2o! + 1 


non compact support 


2a 

n 2a +! 


a 

n a+1 ~p 



Table 1: Minimax rates on n £2,00 (up to a logarithmic term) with 1 < p, q < 00, a > 
max (O, I — under the || • |||-loss. 



p > 2, Sp j(? -spaces typically model dense signals where the wavelet coefficients are not large but most 
of them can be non-negligible. This explains why the size of the support plays a role on minimax rates 
when p > 2: when the support is larger, the number of wavelet coefficients to be estimated increases 
dramatically. 



Since arguments for proving Theorems [3] and d] are similar to the arguments used in lRevnaud-Bouret and Rivoirard 



(2008). oroofs are omitted. We just mention that these results are derived from the oracle inequality 



established in Theorem [TJ 

4 The curse of support from a practical point of view 

Now let us turn to a practical point of view. Is there a curse of support too? First we provide a 
simulation study illustrating the distortion of the most classic support dependent estimators when the 
support or the tail is increasing. Next we provide an application of our method to famous real data 
sets, namely the Suicide data and the Old Faithful geyser data. 

4.1 Simulations 

We compare our method to representative methods of each main trend in density estimation, namely 
kernel, binning plus thresholding and model selection. The considered methods are the following. The 
first one is the kernel method, denoted K, consisting in a basic cross-validation choice of a global 
bandwidth with a Gaussian kernel. The second method requires a complex preprocessing of the data 
based on binning. Observations X\, . . . , X n are first rescaled and centered by an affine transformation 
denoted T such that T(X\), . . . , T(X n ) lie in [0, 1]. We denote /t the density of the data induced by 
the transformation T. We divide the interval [0, 1] into 2 bn small intervals of size 2~ bn , where b n is an 
integer, and c ount the number of observations in each interval. We apply the root transform due to 
Brown et al. (120071 ) and the universal hard individual thresholding rule on the coefficients computed 



with the DWT Coiflet-basis filter. We finally apply the unroot transform to obtain an estimate of Jt 
and the final estimate of the density is obtained by applying T~ l combined with a spline interpolation. 
This method is denoted RU. The last method is also support depen d ent. A fter rescaling as previously 



the data, we estimate fx by the algorithm of IWillett and Nowakl (|20071 ). It consists in a complex 



selection of a grid and of polynomials on that grid that minimizes a penalized loglikelihood criterion. 
The final estimate of the density is obtained by applying T . This method is denoted WN. 
Our practical method is implemented in the Haar basis (method H) and in the Spline basis (method 
S)(see Figure [9] in Appendix A for a complete description of this basis). Moreover we have also 
implemented the choice 7 = 0.5 of (|2.11|) in the Spline basis (see Section [2]). We denote this method 
S*. 
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The thresholding rule proposed in Juditsky and Lambert-Lacroix ( 20041 ) has also been considered. 
For their prescribed practical choice of the tuning parameters and the Spline basis, the numerical 
performance is similar to those of method S. Since thresholding is not performed for the coarsest level, 
the approximation term of the reconstruction is based on many non zero negligible coefficients for 
heavy-tailed signals: this leads to obvious numerical difficul ties without significant impact on the risk. 
So, numerical results of the thresholding rule proposed in Juditsky and Lambert-Lacroix ( 20041 ) are 
not given in the sequel. 

We generate ?i-samples of two kinds of densities /, with n = 1024. Both signals are supported by 
the whole real line. We compute for each estimator / the ISE, i.e. Jm(/ — /) 2 which is approximated 
by a trapezoidal method on a finite interval, adequately chosen so that the remaining term is negligible 
with respect to the ISE. 

The first signal, gd, consists in a mixture of two standard Gaussian densities: 

g d = l -M{Q,l) + l -M(d,l), 

where M(fi, a) represents the density of a Gaussian variable with mean \i and standard deviation a. 
The parameter d varies in {10,30, 50, 70} so that we can see the curse of support on the quality of 
estimation. 
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Figure 2: Reconstruction of gd (true: dotted line, estimate: solid line) for the 6 different methods for 
d = 10 



Figure [2] shows the reconstructions for d = 10 and Figure [3] for d = 70. In the sequel, the method 
RU is implemented with b n = 5, which is the best choice for the reconstruction with d = 10. All the 
methods give satisfying results for d = 10. When d is large, the rescaling and binning preprocessing 
leads to a poor regression signal which makes the regression thresholding rules non convenient, as 
illustrated by the method RU with d = 70. Reconstructions for K, WN, S and S* seem satisfying 
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Figure 3: Reconstruction of gd (true: dotted line, estimate: solid line) for the 6 different methods for 
d= 70 



* * 



Figure 4: Boxplot of the ISE for over 100 simulations for the 6 methods and the 4 different values 
of d. A column, delimited by dashed lines, corresponds to one method (respectively K, WN, RU, S, 
H, S*). Inside this column, from left to right, one can find for the same method the boxplot of the 
ISE for respectively d = 10, 30, 50 and 70. 
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but a study of the ISE of each method (see Figure Hj) reveals that both support dependent methods 
(RU and WN) have a risk that increases with d. On the contrary, methods K and S are the best 
ones and more interestingly their performance does not vary with d. This robustness is also true for 
H and S*. S* is a bit undersmoothing: this was already noticed in Figure[T](G) and this explains the 
variability of its ISE. Finally note that, for large d, H is even better than RU despite the inappropriate 
choice of the Haar basis. 

The other signal, hk, is both heavy-tailed and irregular. It consists in a mixture of 4 Gaussian 
densities and one Student density: 

hk = 0.45T(fc) + 0.15 0.05) + 0.1 M(-0.7, 0.005) + 0.25 0.025) + 0.15 AT (2, 0.05), 

where T(k) denotes the density of a Student variable with k degrees of freedom. The parameter k 
varies in {2,4,8, 16}. The smaller k, the heavier the tail is and this without changing the shape of 
the main part that has to be estimated. Figure [5] shows the reconstruction for k = 2. Clearly RU 
does not detect the local spikes at all. Indeed the maximal observation may be equal to 1000 and the 
binning effect is disastrous. The kernel method K clearly suffers from a lack of spatial adaptivity, as 
expected. The four remaining methods seem satisfying. In particular for this very irregular signal it 
is not clear that the Haar basis is a bad choice. Note however that to represent reconstructions, we 
have focused on the area where the spikes are located. In particular the support dependent method 
WN is non zero on a very large interval, which tends to deteriorate its ISE. Indeed, Figure shows 
that the ISE of the support dependent methods (RU, WN) increases when the tail becomes heavier, 
whereas the other methods have remarkable stable ISE. Methods S and H are more robust and better 
than WN for k = 2. The ISE may be improved for this irregular signal by taking 7 = 0.5 (see method 
S*) as already noticed in Section 2 for irregular signals. 
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Figure 5: Reconstruction of hk (true: dotted line, estimate: solid line) for the 6 different methods for 
k = 2 
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Figure 6: Boxplot of the ISE for hk over 100 simulations for the 6 methods and the 4 different values 
of k. A column, delimited by dashed lines, corresponds to one method (respectively K, WN, RU, 
S, H, S*). Inside this column, from left to right, one can find for the same method the boxplot of 
the ISE for respectively k = 2, 4, 8 and 16. 



4.2 On real data 



To illustrate and evaluate our procedure on real data, we consider two real data sets named, respec- 
tively in our study, "Old Faithful geyser" and "Suicide". The "Old Faithful geyser" data are the 
duration, in minutes, of 1 07 eruptions of Old Faithful geyser located in Yellowstone National Park, 
USA; they are taken from IWeisberg (jl980h . The "Suicide" data set is related to the study of suicide 
risks. Indeed, each of the 86 observations corresponds to the number of days a p atient, considered 



as con trol in the study, undergoes psychiatric treatment. The data are available in lCopas and Frye 



(1980). In both cases, we consider that we have a sample of n real observations X\, . . . ,X n and we 



want to estimate the underlying density /. We mention that in the first situation, all the observations 
are continuous whereas, in the second one, the observations are discrete. These data are well known 
and have been widely studied elsewhere. This allows to compare our procedure with other methods. 
To estimate the function /, we apply /,f rac , with the Spline basis (see Figure [9] in Appendix A) and 
jo — 7. We plot, on the same graph the resulting estimate and the histogram of the data. Figures 
[7] and [8] represent, respectively, the results for the "Old Faithful geyser" set and for the "Suicide" 
one. Note that concerning the "Suicide" data set, there exists a problem of "scale": if we look at the 
associated histogram, the scale of the data seems to be approximately equal to 250, and not 1. So we 
divide the data by 250 before proceeding to the estimation. 

Respectively two or three p eaks are detected provi ding multimoda l recon structions. So, in comparison 
with the ones performed in Silverman ( 19861 ) and Sain and Scottl (jl996h . our estimate detects signif- 
icant events and not artefacts. More interestingly, both estimates equal zero on an interval located 
between the last two peaks. This cannot occur with the Gaussian kernel estimate mentioned previ- 
ously. Of course, this has a strong impact for practical purposes, so this point is crucial. This tends 
to show that the proposed procedure is relevant for real data, even for relatively small sample size. 



15 



0.5 " 



0.4 " 



0.3 " 



0.2 " 



0.1 " 



1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 



Figure 7: Histogram (solid line) and reconstruction via f^ rac (dashed line) for the "Old Faithful 
geyser" data set 
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Figure 8: Histogram (solid line) and reconstruction via f^ rac (dashed line) for the "Suicide" data set 

A Analytical tools 

All along this paper, we have considered a particular class of wavelet bases that are described now. 
We set 

4> = l[o,i]- 

For any r > 0, we can claim that there exist three functions ip, <p and ip with the following properties: 

1. (f) and ip are compactly supported, 

2. (j) and ip belong to C r+ , where C r+1 denotes the Holder space of order r + 1, 

3. ip is compactly supported and is a piecewise constant function, 

4. ip is orthogonal to polynomials of degree no larger than r, 
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5. {((f>k,tpjk)j>o,k€Z, {^k,^jk)j>o,kez} is a biorthogonal family: for any j,f > 0, for any k, k', 

ipjk(x)4>k'(x)dx = / c/)k(x)tpj>k'(x)dx = 0, 



4>k(x)4>k'(x)dx = l k =k', / ip j k(x)'tpj'k'(x)dx = l j=j , k=k ,, 



where for any x E 
and 



4>k(x) = <p(x - k), 4> jk (x) = 22i/j(2 J x - k) 



<f>k(x) = 4>(x - k), ijjjkix) = 2iip(2 J x - k). 
This implies the following wavelet decomposition of / E L2( 



where for any j > and any k E 



Oik 



/ = £ ak ^ k + £ £ Pjkfyk, 



f(x)4> k (x)dx, P jk = / f(x)tp jk (x)dx 



mc 



Such biorthogonal wavelet bases have been built by ICohen et all (Il992h special case of spL__ 
systems (see also the elegant equivalent construction of Donohol "? 1994 ) from boxcar functions). The 
Haar basis can be viewed as a particular biorthogonal wavelet basis, by setting cf) = <p and ip = ijj = 
l[o I) ~~ l[i i]) with r = (even if Property 2 is not satisfied with such a choice). The Haar basis is 
an orthonormal basis, which is not true for general biorthogonal wavelet bases. However, we have the 
frame property: if we denote 

* = {0,-0,^} 

there exist two constants ci(^) and C2(^) only depending on such that 



£ ^ + ££/?*< 11/112 < <*(*) £ 4 + £ £ 0] 



i>o fcez 



For instance, when the Haar basis is considered, ci(#) = ci{&) = 1. 

We emphasize the important feature of such bases: the functions ipjk are piecewise constant func- 
tions. For instance, Figure [9] shows an example which is the one that has been implemented for nu- 
merical studies. This allows to compute easily wavelet coefficients without using the discrete wavelet 
transform. In addition, there exists a constant /i^ > such that 



inf 

IS [0,1] 



0*01 > 



inf 

:reSupp(i/>) 



(x)\ > m» 



where Supp(V0 = {x E M : ip{x) / 0}. 

This technical feature will be used through the proofs of our results. To shorten mathematical 
expressions, we have previously set for any k E Z, tp-ik = 4>k, V'-ifc = 4>k and 0-ik = Oft- 

Now, let us give some properties o f Besov spaces. Besoy space s, den oted BZ V are classic ally defined 
by using modulus of continuity (see DeVore and Lorentz ( 19931 ) and Haxdle et al. ( 19981 )). We just 
recall here the seq uential characterization of Be sov spaces by using the biorthogonal wavelet basis (for 
further details, see lDelvon and Juditskvl (jl997l )). 
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Figure 9: Biorthogonal wavelet basis with r = 0.272 that is used in the Simulation study. First line, 
(ft (left) and ift (right), second line (ft (left) and ift (right). 



Let 1 < p, q < oo and < a < r + 1 , the £>p i(J -norm of / is equivalent to the norm 



3j,k)k\\j 



l/« 



if q < oo, 
if g = oo. 



|(a*)fc|<p + [Ej>o 2J 
|(Q!fc)fc|<p +sup i > 2 Jv " 2 P ; 
We use this norm to define Besov balls with radius R 

i3p, q ( R ) = if e L 2W : \\f\\«, P , q < R}- 

For any R> 0, if < a' < a < r + 1, 1 < p < p' < oo and 1 < q < q' < oo, we obviously have 

B« q (R) C Bp ql (R), B« q (R) c B* q (R). 

Moreover 

W)C^>) if a -±>a'-± 

The class of Besov spaces provides a useful tool to classify w avelet decomposed signals with respect to 
their regularity and sparsity properties (see Johnstone ( 19941 )). Roughly speaking, regularity increases 
when a increases whereas sparsity increases when p decreases. 



B Proofs 

B.l Proof of Theorem CD 

Because of the frame property of the biorthogonal wavelet basis, it is easy to see that 

ci(#)|/3 - P\\l < ||/n, 7 - /HI < - (3\\l, 



(B.l) 



where denotes the sequence of thresholded coefficients ($jk~^(j,k)er n )(j,k)eA an d P denotes the true 
coefficients ((3jk)(j,k)eA- Consequently, it is sufficient to restrict ourselves to the study of the ||/3 — /?||f • 
Consequently the proof of Theorem p] relies on the following result (see Theorem 7 of Section 4.1 



Revnaud-Bouret and Rivoirardl (200 
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Theorem 5. Let A be a set of indices. To estimate a countable family (3 = (/3a)asA such that 
\\(3\\t 2 < oo, we assume that a family of coefficient estimators ((3x)xer, where T is a known deterministic 
subset of A, and a family of possibly random thresholds (rix)x£r are available and we consider the 
thresholding rule $ = (/3a 1 |/3 A |>r? A lAer)AeA- Let e > be fixed. Assume that there exist a deterministic 
family (Fx)xeT an d three constants k G [0, 1[, u G [0, 1] and fj, > (that may depend on e but not on 
A ) with the following properties. 

(Al) For all A G T, 

V(\Px-I3x\>ktix)<u. 
(A2) There exist 1 < p, q < oo with ^ + ~ = 1 and a constant R > such that for all A G Y, 

(E(|& - P x \ 2p )Y < itW(F A ,F A M). 

(AS) There exists a constant 8 such that for all A 6 T satisfying Fx < 9e 

n\k~Px\>^x ,0x\>V\)<F\V- 



Then the estimator (3 satisfies 



1 — k 2 ^,, 7, ^,,9 m . „ 1 + k 2 v-^ ^9 1 — K 



2 



l + K 2 

with 



E 



A^m Asm Asm I AeT 



(l + + (1 + e 1 ^)e llq ^ 1/q ^ 



To prove Theorem [Q we use Theorem [5] with A = (j,k), (5x = fijk defined in (|2.4p . r/jfc = rjjk^ 
defined in (12.51) and 



T = T n = {(j, /c) G A : — 1 < j < jo} with 2 ja < n c (logn) c ' < 2 jo+1 . 

We set 

'SuppO^fc) 

Hence we have: 



Fjk = / f(x)dx. 



Yj F i k= Y Y f(x)dx < / f(x)dx Yj Y 1 ^GSupp(^ fe ) < Uo + 2 ) m </" 

(j,*)er„ -i<i<io k J ^Su PP (^ jk ) J -i<j< j0 k 

(B.2) 

where m^, is a finite constant depending only on the compactly supported function ip. Finally, 
k)ev n Fjk is bounded by log(n) up to a constant that only depends on c, d and the function 
tp. Now, we give a fundamental lemma to derive Assumption (Al) of Theorem [5j 

Lemma 1. For any 7 > 1 and any e' > there exists a constant M depending on e and 7 such that 

F(a%>(l + e')a%)<Mn-y. 
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Proof. We have: 



^2 
a 3k 



2n( n- 1) I>i»w - ^(^)) a 



i=l 



1=1 



n i—1 



n(n — 1) 



n(n — 1) 



i=2 J=l 



(B.3) 



with 



n i—1 



1 

Sn = - ^(^jfc(Xi) - P jk ) 2 and n„ = ^ ^(^(Xj) - P jk )(ip jk (Xi) - jk ). 

n i=l i=2 1=1 

Using the Bernstein inequality (see section 2.2.3 in iMassartl (|2007 )) applied to the variables Yi with 



n — n 



one obtains for any u > 0, 



with 



3n 



2\ 2 



We have 



v jk = — E [(a] k - (rJ> jk (Xi) - (3 jk f) 



Vjk = ~ [<t% + E [(il> jk {Xi) - (5 jk f\ - 2a] k K [(ij> jk (Xi) - /? jfc ) 2 ] ) 



< 



< 



n 
1 

n 



(E [(^-fcTO-^fc) 4 ] -4) 



+ W 



4a 



2 



Finally 



2u 



n 3n 



(B.4) 



Now, we deal with the degenerate U-statistics u n . We use Theorem 3.1 qflHoudre and Revnaud-Bouret 
(2003) combined with the appropriate choice of constants derived by Klein and Rio ( 2005 ): for any 
u > and any r > 0, 



1+r 



n > (1 + t)CV2^l + 2L>u + -^Fu + y/2(3 + r" 1 ) + - Bu 3/2 + — ,4-u 2 < 3e~" 



3 + r" 1 



(B.5) 
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Now we need to define and control the 5 quantities A, B, C, D and F. For this purpose, let us set 
any x and y, 

9jk(x,y) = (ipjk(x) - Pjk)(ipjk{y) - Pjk)- 

We have: 



A = \9jk\oo < 



Furthermore, 



The next term is 



n i-l , 

i=2 1=1 



Cn i-l \ 
i=2 1=1 J 
n i—1 

SU P YsYs^^jkiXi) - /3 jfc )a i (X i ))E((^ fc (X0 - jk )bi(Xi)) 

E£a?(*i)<l. EJ2bf(X t )<l i=2 l=1 

n i—1 

< ™p E J^naKmJvinbW)- 

EE«?W)<1,EE^)<1 !=2 |=l V 



So, we have 



D < a 



jk 



sup Yl M a2 (^)) 



E£a?(Xj)<l, EE^ft)<l 



i=2 



i-l 



^E(fe2(^))v^T 
>J i=i 



< a jk sup 

EE<*?(Xi)<l \J i 



E E ( a ?( x *)) 



i=2 



< a 



2 . /n(n-l) 



Still using Theorem 3.1 of Houdre and Reynaud-Bouret ( 20031 ). we have: 



n-l 



B 1 



sup^E((^fc(t) - jh f(ijj jh {Xi) - /?ifc) S 



l=i 



< 4(n-l)||^ fc ||L4- 

< 4(n-l) 



Finally 



F = E I sup 



i-l 



'jk, 



1=1 



< 2||^jfc||ooE sup 



i-l 



Y^jk(Xi) - jk ) 



1=1 



To control this term, we set 



i-l 



Zi = Y^Jk(Xi)-Pjk)- 



i=i 
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Applying Lemma 1 of iDevroye and Lugosil (120011 ). for any s > 0, for any i < n, 

E (e sZs ) < e* 341 ^'* 1 *^ 1 ^ 8 < e * 2 (^-i)ll^ te ||g /2_ 

Similarly, 

E (e^ 1 ) < e aa(n-1)l ^' fcl » /2 . 
Hence, by Lemma 2.2 of Devrove and Lugosil ( 200ll ). 

i-l 



E sup 



/=i 



< 



^2(n-l)log (2n). 



Hence 



Now, for any u > 0, let us set 



and 



F < 2v^|^ifc|»V("-l)log(2«)- 



5(«) = 2||^ fe || 0O a ife W2- + -g- 



£/(u) = (1 + r)CV2^ + 2L>u + + ( y/2(3 + r" 1 ) + - ) Bu 3 / 2 + - - ' Au 2 . 



3 + r- 1 



Inequalities (|B.4f) and (IB.5j) give 



n(n — 1) 



U(u)) = ¥(a z jk >s n + S(u) + - -(U(u)-u n ) 

\ J n(n — 1) 

< P [o) k > s n + S{u)) + F(u n > U(u)) 

< 4e" n . 



Let us take u = 7log?i and r = 1. Then, there exist some constants a and b depending on 7 such that 



S(u) + . - M (u) < 2a ifc ||^ fc |oc\/2 7 ^ + aa 2 ^ + 



n(n — 1) 



n ]tt n 



log n 



n 



3/2 



So, 



and 



<y jk > a jk + 2aj k \\tpj k \\ 00 ^ 2j — — + acT jfc — — + ^l^fcloo 



7? 



log?! 



< 4n -7 



Now, we set 
and 



Oiu [ 1 — a 



log n 



n 



lo^ 77; 

2(Jjfc||^fc|| 00 ^27— a jk - bjipjk 



logn 



3/2 



> < 4n" 7 . 



log?! 

J- — a I 1 u 2 — IIVi/clloo 



?? 



2;- 



log n 



n 



h = ^ k + b\\i Jjk \\l 



logra\ 
n J 



3/2 
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with #i, 2 , O3 > for n large enough depending only on 7. We study the polynomial 

p(a) = Oia 2 - 20 2 cr- 3 . 
Then, since a > 0, p(a) > means that 



which is equivalent to 



Hence 



So, 



0% > ^ ( 20 2 2 + Ms + 20 2A /0| + 0103 ) ) < 4n"T. 



20 2 V03" , 40f 



So, there exist absolute constants 5, 77, and r' depending only on 7 so that for n large enough, 



P > ffj fc (l + ^) + (l + ^ ) 2||fe|UA/27^^ + fryhMSo^ I 1 + ( ^ ) I I < 4 "" 7 - 
Hence, with 



-2 lo g n , o 11, 11 2 lo g« f , , , flognV /4 \ 



3ft = ^fc + 2 Hjk\\oo\l 2^ k ^- + 8ry\1> jk 
for all e > there exists M such that 



2 i?S n , o-.IL, .12 l°g™ 



00 „ i 



(a%>(l + e')a%)<Mn-r. 



Let k < 1. Applying the previous lemma gives 



I lojj 77- 



log n 2K7log n 



3n 



/ TO I Ifl I ^ /o 2 ~2 lQ g re , 2K7lpgra y/jfciioo 2 , 2 



LTO / I a fl I s /o 2 ~2 lo § n , 2K7l0gw||y J - /fc || 00 2 , 2 

+P |/3 ifc - /3 ife | > W2/«V|fc— — + ^— , »ifc < (1 + e )0jk 



n 3n 

r 2 — /-i 1 _\~2 



< p(4>(i+!« 



n 3n 
Using again the Bernstein inequality, we have for any u > 0, 



fe- W >J^ + 5*ik|< 2e -« 
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So, with e' = 1 — k, there exists a constant M K depending only on k and 7 such that 

VQPjk - Pjk\ > *Vjk, 7 ) < M K n-^ 2 ^' K \ 

So, for any value of k 6 [0, 1[, Assumption (Al) is true with r]j k = T]jk ~ if we take u = M K n~ 1K ' ! /( 2 ~ K ). 
Now, to prove (A2), we use the Rosenthal inequality. There exists a constant C(p) only depending on 
p such that 



1 



E 



< 

< 

< 
< 



n 2p 

C( P ) 

n 2p 



2p 



i=l 

n 



2p 



' n 



+ (5^Var(^ fc (Xi)) 



n 



ife 00 J 



\2p-2 



Var(^(X;)) + £Var(^(X,)) 



vi=l 



n 2p 

C( P ) 

n 2p 



((2|^|oo) 2p nF ifc + ^|^ fc |^)- 



Finally, 



< 



4C(p)^ m ax(|<^;|Ve; 



So, Assumption (A2) is satisfied with e = ^ and 



i? 



8C(p)^°max(|<^;||^) 



Finally, to prove Assumption (A3), we use the following lemma. 
Lemma 2. We set 

n 



i=l 



14 7 14 

eSupp(^ fc )} and ° _ ~3~ - Y 1 



There exists an absolute constant < 6' < 1 suc/i t/iat if nFj k < 9'C'logn and (1 — #')logn > | then, 

F(N jk - nF 3k > (1 - e')C'\ogn) < F jk n~\ 
Proof. One takes 0' € [0, 1] such that 



We use the Bernstein inequality that yields 

P(%. - nF jk > (1 - ^OClogn) < exp 



(1 - #') 2 4 
(20' + 1) - 7' 

((1 -6')C'\ogn)< 



2{nF jk + (1 - 0')C'logn/3) 



< n 2(26' + l) 
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If nFjk > re 7 1 , since 3 2(20'+i) — ^7 + 2, the result is true. If nF^ < re 7 1 , using properties of 
Binomial random variables (see page 482 of Shorack and Wellner (jl986l )). for n > 2, 

P(N jk - nF jk > (1 - 0')C"log re) < P(iV jfc > (1 - 0')C"log n ) < P (jv ifc > 2) 



< 



< 



;i - F jk )ClFj k {\ - F jk 
l-3- l {n + l)F jk 



2(1 - 2~ 1 nF. 



jk) 



< (nF, 



jk) 



and the result is true. 
Now, observe that if \$j k \ > rjj k> ^ then 



Indeed, \fij k \ > Vjk^ implies 



Clog n . 



N jk > Clog re. 



j'fcloo 5; \@jk\ — 



oo-Nj k 



n 



So, if n satisfies (1 — 0') log re > |, we set = #'C"log (n) and fi = n 7 . In this case, Assumption (A3) 
is fulfilled since if nFj k < 9'C'log n 

H\P jk - P jk \ > Kri jk>J , \$ jk \ > r, jka ) < P(N jk - nF jk > (1 - 6')C\ogn) < F jk n~^ . 
Finally, if n satisfies (1 — #')logn > |, we can apply Theorem [5] and we have: 



1_|Je|)M& < inf I 

1 + K z mcr„ 



(j,k)^m (j',fc)Gm (j,k)(zm 

In addition, there exists a constant i^i depending on p, 7, k, c, c' and on -0 such that 

2 

10 2 F ik < K 1 (]ag{n)) d+x n Cr - 



U>k)er n 
(B.6) 



(B.7) 



Since 7 > c, one takes k < 1 and g > 1 such that c < q ^2~ K ) an< ^ as required by Theorem [H the last 
term satisfies 

LD £ F # <^, 
J re 

(i,fc)er n 

where is a constant. Now we can derive the oracle inequality. Before evaluating the first term of 
(|B.6j) . let us state the following lemma. 

Lemma 3. We set for any (J, k) £ A 



Djk = J tpj k (x)f(x)dx, 
S^p = max{ sup |^(^)|, sup |^(x)|} 

££Supp(</>) xgSupp(i/>) 
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and 



J v , = min{ inf \(f){x)\, inf |^(a;)|}. 



Using Appendix A, we define = For all (j, k) G A, we have the following result. 
- IfF jk < 9^, then 0j h < @lD jk ^M. 



- IfF jk > then ||^ fe |U^ < y/^K. 

Proof. We assume that j > (arguments are similar for j = —1). 
If F jk <&^-^, we have 



3 J I I 

\Pjk\ < S^2^Fj k < Sjjj22 y/Fjky/Q^ 



log (n) 



< S qll IT\/Q 



D jk \og (n) < /D jk log (n) 



since -Djfc > F\p?Fj k . For the second point, observe that 



£>j fc log (n) 



-log (n) 



235, 



log (n) 



> 



n 



log (n) 



Now, for any 5 > 0, 



tw 2 n ^ /-, . n 2 7 logn 2 ! ^27logn\ 

E(r/ ifei7 ) < (1 + 5) — - — E(o-- fc ) + (1 + S )' 



3n 



y ifc|| 



Moreover, 
So, 



D 



< (1 + <5)^ + (1 + ^ 1 )8 7 logn 



n 



n 



v- 



D 



E(»& J < (1 + 5) a 2 7 logn-^ + A(<5) 



7log n 



n \ n 

with A(<5) a constant depending only on 5. Now, we apply (IB.6j) with 



2 

oo • 



m = <J (j, k)eT n : /3% > O^^log n 



(B.8) 



so using LemmaO we can claim that for any (j, fc) G m, Fj k > lo s( n ) _ Finally since 0^ > 1, 



np-P\\i < k 3 [ e ^ 1 { ^< e ^io g n. } + £ * 



(j,fe)6r„ 



log n ( log n 
F>j k + 



+ 



^4 



n 



< K 3 



< 2K 3 



n. u +21ogn^l 
n 



E min(/3| fc ,G>gn^)+ £ /3; 



{^>ejiog«^} 



+ E * 



+ 



(i,fe)er 



+ 



K 4 



n 
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where the constant K3 depends on 7 and c and K4 depends on 7, c, d and on tp. Finally, since 

Djk = &jk + 0jki 



E||/3 



II f 2 



< 2K* 



< 2K* 



(j,fe)er„ 



(j.*)^r„ 



+ 



A" 1 



n 



e|logn 



^ min U*,eJlogn-^ + £ + E & 



+ 



A 4 



< 2A^G 



3^ 



min 

(j,fc)er„ 



0}k,**n-f + E * 



+ 2^1/31^ + f. 



Theorem [T] is proved by using properties of the biorthogonal wavelet basis. 



B.2 Proof of Theorem |2] 

The first part is a direct application of Theorem [H Now let us turn to the second part. We recall that 
we consider / = l[o,i], the Haar basis and for j > and k G Z, we have: 



^ loen 9 log re 

a% = a% + 2||^ fc ||ooV 2 7^ 2 fc ^- + H^jkf 



So, for any < e < Kf- < L 



2 logn , x 



0% < (1 + e)u% + 27||<MlL^F + 4 ) • 



Now, 



Vjk,-y 



2ja 



2 logrc 



+ 



3 7logn 



3n 



< W2 7 



logn 



(l+e)a2 fc + 27|^ fc | |2 ^ ^ 



loen 



n 



n 

ool^ogn ( 1 
n ~ V 3 



Furthermore, using (jB.3[) 



and 



J n(n — 1) 



•ih, 



D 7log n 



3?i 



V. log n I log n 2 
2 7 (1 + e)^s n + 1/27(1 + x — u n + 

n W n re(n — 1J 

Using (|B.5p . with probability larger than 1 — 6n~ 2 , 

\u n \ < U(2logn), 



D 7log n / 1 



n 



(I + v7+l^) 
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and, since / = l[o,i], we have <r 2 fc < 1 and 



n[n — 1) re \ n 



where C\ and C2 are universal constants. Finally, with probability larger than 1 — 6n 2 , 



y n nyn — 1) n \ n J 

So, since 7 < 1, there exists w (e), only depending on e such that with probability larger than 1 — 6n~ 2 , 



, jk!l < W2 7 (l + £ )^ Sn + ^( £) - ■ l0gn 



n - 1100 n 



Since f'i/'jfcloo = 2 J / 2 , we set 



logn 22 logn 



n n 



and f?jfe j7 < ??jA;,7 with probability larger than 1 — 6n 2 . Then, since / = l[o,i]> Pjk = for j > and 

1 " 

n i=i 

2 J 2 

/ , ( 1 X i G[fc2-i,(fc+0.5)2-J[ - 1 X i G[(fc+0.5)2-i,(fe+l)2-J[) 



n . 
1=1 



with 



N jk ~ E 1 ^ I 6[fc2-3,(fc+0.5)2-J[ 5 ^jfe - X] 1 ^ ! e[(fc+0.5)2-3,(fc+l)2-J[- 
i=l i=l 

We consider j such that 

n ; 2re 

T, r- < 2 J < — , a > 1. 

(logra) tl ~ (logra) a 

In particular, we have 

fi^<n2-'<0ogn)". 



Now, 

1 n 22 



n » — ' re 

i=l 
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Hence, 



2^-1 



E(||/n, T -/||) > T, E (^h 1 



k=0 
2i-l 



\>Vjk,-, 



> E E 



fc=0 



/fc 1 |^|>r 7 ^ 7 1 K»|<C/(21ogr l ) 



2 J 



k=0 



\0 jk \>^j{l+ S )s n ^+w(e)^^ 1 |«n|<C/(21ogn) y ) • 



fc=0 



2-? 



j 

2? 



|iv+ -iVr, |>^2 7 (l +£ ) f (iV+ +JV- & ) 
- E ^2 E (^j* - ^ 2 V+fc-^^I^)^ 
" ^ E " ^7l) 2 V; i -^l>y2 7 (l +£ )( J v; i+ 7V-)lo g „ + ^( £ )lo g „ 1 l^l^ l/ ( 21 °S«)) • 



Now, we consider a bounded sequence (w n ) n such that for any n, w n > w(e) and such that is an 
integer with 



47(1 + e)fi n j log(n) + w n login] 



and /i n j is the largest integer smaller or equal to n2 J 1 . We have 



v n j ~ 47(1 + e)jl n j logra 



and 



So, if 



then 



(log n) ( 



1 < n2~ j ~ 1 - 1 < fl nj < n2" i ' 1 < 



1 ^ (logn) 



^y^l — P"nj ~\~ „ \J v nj 1 — An? 9 \J v nj ■, 



27(1 + e) ( N+ + NT X ) log n + w n log n. 



Finally, 

2 2j / 

E(|/n, 7 -/|i) > -^v nj ¥\N+ = fi nj + - 

> v n ,(logny 2a 

N n = 

> v n] {\ogn)- 2a 



2 finj 1 — jlnj 2* 



1 



P 



1 

2 V 



Vnj, \u n \ < U(2logn) 



\un\ > l7(21ogn)) 



l n j\ni n j\{n l n j Tn n j)\ 



(1 - 2p-) n ~( lnj+mn ^ 
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with 

and 

So, 



<"nj — l^nj ~i~ 2 V ' ^nnj — f^nj ^ y^nj j 

P? = / t [k2-^ ! (k+o.5)2-J[(x)f{x)dx = j l[( k +o.5)2-^{k+ip-J[(x)f{x)dx = 2~ J ~ 1 . 



H\\fn,j ~ /111) > Mlogn)- 2 * x 
Now, let us study each term: 



V 



2fi n 



(1 - 2 Pj ) n - 2 ^ 



711 



l n j\m nj \(n - 2jl nj )\ 



pf n3 (l-2 Pj ) n - 2 ^ 



n- 



■- exp (2fi nj log(pj)) 

: exp(2A nj log(2^ 1 )) , 

exp ((n - 2/2 nj -) log(l - 2py)) 

exp (-(n - 2/2 nj ) (2-^ + On(2^ 2j ))) 

exp(-n2^') (l + o n (l)), 



n; = n 



n e- n V2^(l + o n (l)), 



Then, 



(n - 2}i nj ) n - 2 ^ 



exp ((n - 2/i n j) log (n - 2/2 nj -)) 
exp ( (n - 2/2 nj -) ( log n + log ( 1 



2/2. 



<<./ 



exp ( (n - 2/2 nj -) logn 



2pL n j (n - 2% 



n 



exp (n log n - 2p, nj - 2jl nj log n) (1 + o n (l)). 



pf" j (l-2 Pj 



(n - 2/i n ,)! 3 



,n-2/i„ 



n 



xpf n *(l-2 Pj ) n - 2 ^ x (1 + „(1)) 



e n " (n - 2fL nj ) n - 2 ^ ^ 

exp [n log n — z/i n j — lfi n j log raj 
exp (2fL nj logn + log^'" 1 ) - n2~ j ) (1 + o n (l)). 



It remains to evaluate l n j\ x m n j\ 



lni\ n] (m nj \ 
e 



yj2irl n jy/2irm n j{l + o n {\)) 



exp (7 nj - log/ n j + m nj logm nj - 2fi nj ) x 2irfi n j(l + o n (l)). 



If we set 



Xnj 



2fi 



On(l), 
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then 



lnj — P"nj ~i~ ~ — fj"nj ( 1 ~i~ %nj ) ; 



m 



nj — fJ"nj 



and using that 



(l + x nj )log(l + x n j) = (1 + 



fl n j{l X n j)i 



X^ x 3 
Xn 3 2 ' 3 ^ ®( Xn j) 



xt „• xz 



X 



•"nj ^n] 2 ray . p.i 4 

nj 7) 1 5 X nj q r ^A^, 



2 3 



l n j log l nj = fl n j (I + X nj ) log (fi n j(l+X nj )) 

= f~L n j(l + X n j) log(l + Z n j) + /U nJ (l + X nJ ) log (f~L n j) 



xi A x 3 



= finj [Xnj + ~Y ~ + O(X^) + /i nj (l + X ni ) log (/in^ 



Similarly, 



So, 



Since 



2 S 



m nj log m ni = /i nj j -x nj + + + O(x^) J + /i nj (l - x nj -) log (/i^ 



l nj log + m n j log m nj = jinj (x nj + 0(x nj )) + 2/} ni log (fi n j) 

< flnjX nj + 2/2 ni log(n2~ i ~ 1 ) + 0(fi n jX n j). 



^njX n j — 



2 _ °nj 



4// 



7(1 + £■) logn, 



for n large enough, 
and 

Finally, 

lnj ■ ^ Tlhij ■ 



Mnj^nj + 0{Jl n jX n j) < (7 + 2e) log 



7i 



Z„j log Z n j + m n j log m n j < (7 + 2e) log n + 2fi nj log(n2 J x ). 

= exp (l n j logl n j + m nj - log m n j - 2p, n j) X 2^/2.^(1 + o n (l)) 

< exp((7 + 2e)logn + 2/i ni log(n2~ : ''~ 1 ) - 2ft nj ) x 2irp, n j(l + o n (l)). 



we derive that 

E(||/ re , 7 - /II) > ^(logn)- 2a x 
> w ni (logn)- 2a x 



n! 



p^(l-2 Pi ) n - 2 ^ -4 



Ijjj \m n j ! (ti 2jl n j)\ 



> v nj (logn) 



nr 2a x 



exp (2/2 nj -logn + 2/2 nj - log (2 J ' x ) - ?i2 J ) 

exp ((7 + 2e) log n + 2fi nj log(n2-J- 1 ) - 2//^-) x 2-Kfi n j 

exp (—(7 + 2e) log n — 2) 6 



277/2, 



(1 + 0.(1)) 
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So there exists C\ and C2 two positive constants such that, for n large enough 

n -(t +2£ ) 6 



E(||/n, 7 -/||i)>Ci(logn 



l-Q 



Co 



(log n) 



n- 



As < 7 + 2e < 1, there exists a positive constant 5 < 1 such that 

E(||/ niT -/|||)>l(l + 0n (l)). 



This concludes the proof of Theorem [2j 
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